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A scheme is presented to extract detailed dynamical signatures from successive measurements of 
complex systems. Relative entropy based time series tools are used to quantify the gain in predictive 
power of increasing past knowledge. By lossy compression, data is represented by increasingly 
coarsened symbolic strings. Each compression resolution is modeled by a machine: a finite memory 
transition matrix. Applying the relative entropy tools to each machine's memory exposes correlations 
within many timescales. Examples are given for cardiac arrhythmias and different heart conditions 
are distinguished. 



Before understanding a complex system, one often 
needs to interpret the complex signals it generates. While 
it is easy to find correlations between arbitrarily sepa- 
rated pairs of points in a signal time series [l[ , highly cor- 
related signals can lack such pairwise correlations (Sec- 
tion I). Instead, at a higher resolution, one can estimate 
the joint probabilities of sequences of events and, eg., find 
the order of corresponding Markov chains ■ Unfortu- 
nately the set of possible event sequences will generically 
increase exponentially with sequence length while becom- 
ing proportionately harder to estimate; i.e. what these 
methods gain in resolution over pairwise statistics, they 
lose in range. We must, however, expect Nature to show 
correlations which are both long-ranged and more than 
pairwise. 

This Letter suggests a tool, akin to the autocorrela- 
tion, which is intuitive, sensitive to more than pairwise 
correlations and yet is long-ranged enough to capture the 
longtime correlations shown by some complex systems. It 
combines two core ideas: 1) a natural measure of the pre- 
dictive power one gains as one has an increasingly long 
symbolic string (Sec. II- V); 2) use of lossy compression 
to express the dynamics of a complex system as a set of 
Markov sources (transition matrices) with each one rep- 
resenting the dynamics on a different timescale (VI- VIII). 
The following considers a system which, at any time t, 
can be in state x% chosen from an alphabet (finite set) A. 
The system passes through states x\,X2-..xt at fixed in- 
tervals and the data is ergodic and stationary. From now 
on Xj, ...Xk will be represented by Xj. Having addressed 
pairwise correlation measures in Section I, Sections II- VI 
develop a new means of mapping correlations in strings 
and VII-IX consider physiological examples and contin- 
uous time series. 

I Pairwise Statistics Can Fail to Capture Structure. 
Any approach which investigates the time structure of a 
data string must be compared with conventional meth- 
ods like the autocorrelation function. Pairwise measures, 
which compare a symbol at one point in a string with a 
(possibly different) symbol at another point, fail to cap- 
ture conditional behavior on other intervening symbols. 
The following one-parameter, order-two transition ma- 
trix with alphabet A4 — {A, B, C, D} should remind the 



reader of this phenomenon; it can create strings without 
pairwise correlations. Using the notation that X is a ran- 
dom variable and a; is a particular instantiation of that 
variable, it is of the form p(X 3 \XiX 2 ) with (xi € Aa): 
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where A ± = j ± d, the free parameter d is < d < | 
and where pq is any ordered pair, p preceding q, other 
than AA, BB, BA, AB. One can prove that, for strings 
generated by this matrix, the probability of obtaining the 
symbol u £ Aa an interval I ^ after v £ Aa is indepen- 
dent of both I and v. The auto/cross correlations of such 
a series are thus indistinguishable from white noise (for 
explanations of symbolic autocorrelations see Voss [l[). 
However, the string is highly structured; measures like 
r Hn\\m below can expose this. Simple measures which 
reveal correlations between points, without having to cre- 
ate data objects which scale exponentially with order, can 
be very useful. The approach in Section VI yields data 
objects that both increase slowly with order/time and 
illuminate more than pairwise correlations. 

77 The Transition Entropy. If a string is suffi- 
ciently long, one can estimate transition probabilities 
p(X t \X^Zm)- ^ should be stressed that this paper is not, 
in the first instance, about the estimation of these prob- 
abilities; we will assume that they arc given to us exactly 
Q. The following entropies investigate the structure of 
this order m transition matrix. 

Call H(X t ) = J2 Xt eA~P( x t)^°S2P( x t) the Shannon 
entropy of X t . The transition, or conditional, entropy 
tH„, is defined as follows: 
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This transition entropy, tH m , measures the entropy of 
predictions one makes when equipped with a length m 
string, when one does not know what the string is. If 
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each length m string that occurs exactly predicts the next 
state, then tH m — (for the series ...ABABA... each 
length 1 string uniquely determines its ensuing state: 
tHi =0). If no string imparts a predictive advantage 
then tH m = H{X t ) Vm. Convexity arguments [1] show 
that tH m > tH n (n > to). 

III The Relative Transition Entropy. The relative tran- 
sition entropy, rH n i\ m , defined below is a measure of gain 
in predictive power as one moves from knowledge of a 
length to string to a length n string (n > mX. The rel- 
ative entropy or Kullback-Leibler divergence [4|] between 
the distribution Q(X) and P{X), where X can take \A\ 
different values, is: D{Q\\P) = j2 X GA Q( x ) lo S It is 
often described as the average disbelief in a model's pre- 
dicted distribution P, when observing random outcomes 
X from real data Q. By contrast, we will use it to cap- 
ture the degree that predictions made when equipped 
with more knowledge of the past, represented by Q, are 
inconsistent with those made with reduced knowledge, 
P. 

One can compare predictions about a symbol at time 
t given knowledge of a particular set of preceding n 
symbols (xlz n ) with predictions made when given only 
the preceding m (xlZm with n > m). The diver- 
gence, D(p(X t \x t t Z 1 n )\\p(X t \x t t zl n )), measures the infor- 
mation lost if one loses the knowledge that the sequence 
x\Z m was preceded by x t t Z™~ 1 . Averaging over all strings 
x\Z n yields the relative transition entropy rH n \\ m : 

r-Hn||m= K^t-iWK^tl^-iMb^tNt-m))- (3) 

t-l 

x t-n 

This quantifies the predictive power lost when one moves 
from having a length n string of prior information to the 
shorter length m, for a randomly selected string x\z\ Q. 
Let us now establish a few properties of rH n \\ m . Using ([2} 
[3]) one can readily prove that riT n ii m = tH m — tH n , n > 
to. If tH m = tH n then rH n \\ m = 0. Since D(Q\\P) > 0, 
with equality only when Q = P, we further know that if 
rH n[[m = then p{X t \x\Z^) = pp^t-m) VX t , x\zl,. 
The length n and to predictions are exactly the same. In 
general D(Q\\P) can be unbounded but here, some 
thought shows that < rH n u m < H(X t ). We can now 
formulate a hierarchy of differential quantities. Defining 
the Shannon entropy for strings of length n as H n — 
H(Xt_ n+1 ), one readily finds that tH n = H n+ i — H n 
and rH n \\ m = tH m — tH n . Authors have noted that the 
way that H n and tH n decrease with n, reveals structure 
in the string @, 0| : m Section VI we will use rH n \\ m to 
map these correlations 

IV Example: rH n \\ m for the distribution in Eq. [7] 
p(X 3 \XiX 2 ) yields the stationary state, p(XiX 2 ) — yg so 
p(X 2 \Xi) = |. By Eq. [3] one finds vH^q = (compar- 
ing p(X 2 \Xi) and p(X 2 )). Comparing p(X^\XiX 2 ) and 
p(X 3 \X 2 ) shows that rif 2 ||i = \ log 2 [2(A+) A+ (A-) A_ ]. 
I.e. knowing the current state is of no help in predicting 



the next state [yHx\\q = 0) but knowing the current and 
preceding state does help (r-ff 2 ||i > and so r_ff 2 ||o > 0). 
Since, for n > 2,m > 1, rH n u m — one concludes that 
knowing more than the preceding and current state gives 
no further predictive advantage. 

V Introducing a measure to detect concealed struc- 
ture. We noted that if the predictions of length to 
and to + 1 strings differ then r£f m+1 || m > 0; how- 
ever, since Eq. [3] is an average, small changes in this 
quantity can hide dramatic changes between the struc- 
ture of order to + 1 and order to transition matrices. 
One can readily construct examples where there exists a 
string x*!^.! suchthat£'(p(X t |x*:i l _ 1 ||p(A t |^:^)) > 
riJ m+1 || m . It is thus useful to introduce the quantity: 

r#m+i||m = max^i^ D{p{X t \x\z] n ^\p{X t \x t t Z 1 m ))) the 
maximum relative transition entropy over all strings 
x^zln in A m . This measures when knowledge of a partic- 
ular extra symbol imparts a large predictive advantage. 

VI Introducing Multiscale Markov Sources. This sec- 
tion introduces a method for describing data from com- 
plex systems by fitting finite state machines with memo- 
ries to each of their different time-scales. Consider coars- 
ening time series to lower and lower time resolutions. 
For each resolution one might estimate a small, order 
to, transition matrix (Markov source is another name for 
transition matrix 0]). Let us call these matrices, one for 
each resolution, a set of multiscale Markov sources. Sup- 
pose the real data was generated by a high order Markov 
source of order I ^> to. An order I source (alphabet A) 
has |^4.|'(|«4.| — 1) parameters. By contrast, we will see 
that a corresponding set of multiscale Markov sources 
requires only ~ log I parameters. The sources thus form 
a compact multiscale representation of the data. 

Let us now examine more details of the coarsening. We 
first break the symbolic series, yf, yi £ A, into consec- 
utive non-overlapping blocks, each c r symbols long. We 
fix c, the basic block size, and let r vary to give different 
block sizes c r (increasing r increases the block size and we 
will see that this lowers the resolution). Then we coarsen 
by mapping each possible block (of which there are \A\ C ) 
onto a single symbol from a smaller set C (|C| < \A\ C ). 
The manner of this map will be discussed below. The 
new coarsened string at resolution r has T/c r elements 
r x^ c with r Xi G C. From this string one can estimate 
an order to Markov source, r m M . Supposing the raw data 
was generated by a source of order I, one might fix \C\, c 
and to and vary r to give a set of Markov sources r m M 
with r € {1, 2, ...|~log c ^]}- By choosing this range of 
r values the set of sources has a similar memory to the 
order I transition matrix. While the order I source needs 
\A\ (|-4| — 1) parameters, the total number of parameters 
in the multiscale sources is |C| m (|C| — l)\log c j^~\. The 
set of Markov sources r m M W < |~log c ^] thus gives a 
compact multiscale dynamic model for the correlations 
at each timescale [8( (see top diagram in Fig. [5]). 
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Such lossy compression lies broadly within rate- 
distortion theory [J|. Distortion measures capture the 
lossiness of maps from blocks to single symbols. The fol- 
lowing motivational example uses the crude Hamming 
distortion. Blocks of c r symbols, each symbol in A, 
can be viewed as coming from an alphabet B c , r of size 
|.4| cr . Call a compression a map / : B CyT — ► C. A 
map is optimal if a version reconstructed from the com- 
pressed string (using an inverse map g : C — > £> c , r ) 
and the original string are as close as possible with re- 
spect to a given measure. The Hamming distortion is 
d(g(f(X t )),X t ) = 5 g{f(Xt )),x t for X t £ B c , r . Given 
p(X t ), the optimal map minimizes the expected symbol- 
by-symbol distortion between the reconstructed and orig- 
inal letters < d(g(f(X t j), X t ) >. Here, some thought 
shows that the optimal /: (1) takes each of the \C\ most 
probable symbols in B c%r to a distinct symbol in C and 
(2) takes all other symbols to an arbitrary symbol in C. 



X 



a) 



■■■■■iiiiIIIIIIIm ■!■ 



25 

Time (s) (mi sees) 




c) 

..1 


mil. 


0.7 0.8 0.9 1 

N-N intervals (s) 


■i 


.1. 



0.2 0.4 0.6 0.8 

V-V intervals (s) 



0.38 0.77 1.15 1.54 1.92 

Time (s) (nrc'secs) 



FIG. 1: A patient with cardiac arrhythmia: a) The m th bar 
gives the predictive advantage of knowing mr seconds of past 
activity over knowing mr — r seconds of activity (r5 m || m _i). 
r = 32ms and t < 35r (more than 99% of all length 35r 
strings, picked uniformly at random from the data, occurred 



more than 300 times). 
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for the original data. 



c,d) The intervals between successive normal and ventricular 
beats respectively, e) i\ff m i| m _i for a coarsened string r = 1, 
c = 12, \C\ = 10 each timestep is thus t' = 0.384s. For 
length 4 and 5 strings in the coarsened data, 97% and 90% 
respectively of all such strings picked uniformly at random 
occurred more than 300 times. 

VII Example: Sudden Cardiac Death. This section ap- 
plies the above tools to heart arrhythmia. A simplified 
view of the heart is that, in any short interval, it can 
have a normal (N) or ventricular beat (V) or no beat at 
all (0). The raw data is a list of times of beats labeled as 
N or V [9] . By discretizing time into blocks of r = 32ms 
this list was converted into a symbolic string of the form 
0N0V00N...'. The following uses 24 hours of heart 
data for a patient with many V beats. Given this three- 
letter alphabet one can attempt to estimate a transition 
matrix of order m: p(X t \x t t Z 1 l ) VA^XjI^ [l(J. Since 



structured, the size of the transition matrix grows slowly 
with 771. Accommodation of hnite size effects in the es- 
timation of such transition matrices, is delicate Q and 
a crude approach was used here (partly justified by the 
wealth of data). The transition matrices can only give 
information about the data as a whole (rather than the 
behavior of the heart at any one time) as, alongside the 
presence of multiscale nonstationarities this patient 
was particularly unhealthy. Fig. [Th.) shows rH m ^ rn _ 1 
for the patient. As mr (the duration of string one is 
given) increases towards the N — N beat interval (see 
Fig. [TJ:)), one's ability to make good predictions in- 
creases markedly. But, when mr nears the heart beat 
interval, further knowledge gives less predictive advan- 
tage (because one is already equipped with knowledge 
of a characteristic time period of the process). As a re- 
sult ri? m M m _ 1 begins to fall around 0.7s, mirroring the 
distribution in Fig. [TJ;). Beats with intervals > Is are 
rare so r\ff m |i m _i for mr > Is is small. Fig [T|d) plots 

r Hm\\m-i (with the strong promise that all strings con- 
sidered occurred more that 300 times in the data). It 
reveals hidden structure between 0.2 and 0.4s. This peak 
is the compound effect of short V — V events and misan- 
notations in the uncorrected record (see Fig. [Hi)). The 
coarsened data, Fig. [Ij), reveals structure on another 
timescale: one sees that a large part of the predictive 
knowledge is contained in the first second of activity but 
another characteristic timescale, open to physiological in- 
terpretation, appears in the range 1.19 — 1.54s. Plots like 
Fig. [T] might distinguish between heart conditions, since 
these can depend on dynamics of a few seconds |l_2j . 

VIII Continuous time series. Multiscale Markov 
sources can also be found for continuous time series (eg. 
wind speed) as well as symbolic strings. The data, sam- 
pled at T times, is again broken up into blocks of c r 
consecutive points and coarsened. The alphabet we com- 
press to is now a set of \C\ letters with each one represent- 
ing a different motif of c r consecutive reals. We allocate 
each block of raw data to its closest motif using a mean- 
square distance. For example, suppose we want to com- 
press blocks of two data points (c = 2, r = 1) to one of 
three symbols and we are given a three letter code book 
with three motifs: C — {N = (1,1), V = (0.1,2), U = 
(10, 10)}. Using the mean-square distance, the sequence 
of continuous data °yf = ...|1 1.1 10.2 1.5|1.1 1.01 1 is op- 
timally represented by 1 x± = ...\N\V\N\... See Fig [2] 
The optimal set of motifs, for a fixed C|, allow the com- 
pressed sequence to reconstruct the original with mini- 



mum total mean-square error 13j. A selection of algo- 
rithms exists for finding optimal motifs (vector quantizers 
[H). In our case, such algorithms have as input the set 
of T/c r blocks of length c r and the value of \C\. They 
output the set, C, of motifs of length c r which minimizes 
the total mean-square error for this block size. Given 



r <C interbeat interval (~ 0.85s) and the system is very these motifs one can then convert the continuous time 
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FIG. 2: Above: schematic of vector quantization (°yi — » 
r x± c ) and multiscale Markov sources iM. Main: The black, 
grey and white bars are the means, ± standard errors, of 
groups of patients who were healthy, experiencing congestive 
heart failure and atrial fibrillation respectively. Unphysiolog- 
ical and ectopic beat intervals were filtered (as in [1J|) an< i 
hourly trends removed. The fourth, hatched, columns are for 
random phased 1// noise. The x axis considers the beat in- 
terval data at five different resolutions r = 0...4 c — 2. At 
each resolution rHx^ is found; this measures the extra pre- 
dictive power from knowing one symbol over knowing none. 
Gaussian white noise has r\ffi||o — > Vr and is not plotted 
(Brownian noise has rH 2 \\i — * Vr) . 

series into its closest symbolic equivalent (see the above 
example: °yf — > 1 x^ 2 see Fig. [5]). Given this string 
of symbols generated from blocks of c r real valued data 
points, one again determines r m M Vr. 

IX Example. Fig. [2] applies this idea to three differ- 
ent groups of cardiac patients Q , using the Generalized 
Lloyd algorithm to find the appropriate set of motifs, C, 
for each resolution, r The raw cardiac data was an 
'interval series': each data point being the time inter- 
val between successive heart beats. The matrix \M was 
found for r = 0...4, c = 2, \C\ = 10 and the extra pre- 
dictive power from knowing one symbol was estimated: 
rH 1 1 iq. The three different heart conditions can be distin- 
guished (for comparable results see |14j); healthy hearts 
show a slow loss in predictability, the disordered beats 
that occur in atrial fibrillation yield a low degree of pre- 
dictability whereas congestive heart failure shows an in- 
crease in predictability at some scales. 

X Conclusion. This paper presented a means of pro- 
ducing a one parameter map of predictive knowledge ac- 
quired as one is equipped with increasingly long sub- 
strings of a symbolic data set. It suggests that lossy com- 
pression allows this short range mapping technique to be 



compactly extended to the study of longer ranged cor- 
relations. Examples are given where different heart con- 
ditions are distinguished and characterized using these 
methods. Underlying this work is the view that dynam- 
ical signatures of some systems can be found by treating 
them as sets of Markov sources with each source charac- 
terizing dynamics on a different time-scale. 

Thanks to M Costa, A Goldberger, C-F Lee and C-K 
Peng 
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